92 research outputs found

    Compression of DNA sequencing data

    Get PDF
    With the release of the latest generations of sequencing machines, the cost of sequencing a whole human genome has dropped to less than US$1,000. The potential applications in several fields lead to the forecast that the amount of DNA sequencing data will soon surpass the volume of other types of data, such as video data. In this dissertation, we present novel data compression technologies with the aim of enhancing storage, transmission, and processing of DNA sequencing data. The first contribution in this dissertation is a method for the compression of aligned reads, i.e., read-out sequence fragments that have been aligned to a reference sequence. The method improves compression by implicitly assembling local parts of the underlying sequences. Compared to the state of the art, our method achieves the best trade-off between memory usage and compressed size. Our second contribution is a method for the quantization and compression of quality scores, i.e., values that quantify the error probability of each read-out base. Specifically, we propose two Bayesian models that are used to precisely control the quantization. With our method it is possible to compress the data down to 0.15 bit per quality score. Notably, we can recommend a particular parametrization for one of our models which—by removing noise from the data as a side effect—does not lead to any degradation in the distortion metric. This parametrization achieves an average rate of 0.45 bit per quality score. The third contribution is the first implementation of an entropy codec compliant to MPEG-G. We show that, compared to the state of the art, our method achieves the best compression ranks on average, and that adding our method to CRAM would be beneficial both in terms of achievable compression and speed. Finally, we provide an overview of the standardization landscape, and in particular of MPEG-G, in which our contributions have been integrated.Mit der Einführung der neuesten Generationen von Sequenziermaschinen sind die Kosten für die Sequenzierung eines menschlichen Genoms auf weniger als 1.000 US-Dollar gesunken. Es wird prognostiziert, dass die Menge der Sequenzierungsdaten bald diejenige anderer Datentypen, wie z.B. Videodaten, übersteigen wird. Daher werden in dieser Arbeit neue Datenkompressionsverfahren zur Verbesserung der Speicherung, Übertragung und Verarbeitung von Sequenzierungsdaten vorgestellt. Der erste Beitrag in dieser Arbeit ist eine Methode zur Komprimierung von alignierten Reads, d.h. ausgelesenen Sequenzfragmenten, die an eine Referenzsequenz angeglichen wurden. Die Methode verbessert die Komprimierung, indem sie die Reads nutzt, um implizit lokale Teile der zugrunde liegenden Sequenzen zu schätzen. Im Vergleich zum Stand der Technik erzielt die Methode das beste Ergebnis in einer gemeinsamen Betrachtung von Speichernutzung und erzielter Komprimierung. Der zweite Beitrag ist eine Methode zur Quantisierung und Komprimierung von Qualitätswerten, welche die Fehlerwahrscheinlichkeit jeder ausgelesenen Base quantifizieren. Konkret werden zwei Bayes’sche Modelle vorgeschlagen, mit denen die Quantisierung präzise gesteuert werden kann. Mit der vorgeschlagenen Methode können die Daten auf bis zu 0,15 Bit pro Qualitätswert komprimiert werden. Besonders hervorzuheben ist, dass eine bestimmte Parametrisierung für eines der Modelle empfohlen werden kann, die – durch die Entfernung von Rauschen aus den Daten als Nebeneffekt – zu keiner Verschlechterung der Verzerrungsmetrik führt. Mit dieser Parametrisierung wird eine durchschnittliche Rate von 0,45 Bit pro Qualitätswert erreicht. Der dritte Beitrag ist die erste Implementierung eines MPEG-G-konformen Entropie-Codecs. Es wird gezeigt, dass der vorgeschlagene Codec die durchschnittlich besten Kompressionswerte im Vergleich zum Stand der Technik erzielt und dass die Aufnahme des Codecs in CRAM sowohl hinsichtlich der erreichbaren Kompression als auch der Geschwindigkeit von Vorteil wäre. Abschließend wird ein Überblick über Standards zur Komprimierung von Sequenzierungsdaten gegeben. Insbesondere wird hier auf MPEG-G eingangen, da alle Beiträge dieser Arbeit in MPEG-G integriert wurden

    Long-Term Efficacy and Safety of Chronic Globus Pallidus Internus Stimulation in Different Types of Primary Dystonia

    Get PDF
    Background: Deep brain stimulation (DBS) of the globus pallidus internus (GPi) offers a very promising therapy for medically intractable dystonia. However, little is known about the long-term benefit and safety of this procedure. We therefore performed a retrospective long-term analysis of 18 patients (age 12-78 years) suffering from primary generalized (9), segmental (6) or focal (3) dystonia (minimum follow-up: 36 months). Methods: Outcome was assessed using the Burke-Fahn-Marsden (BFM) scores (generalized dystonia) and the Tsui score (focal/segmental dystonia). Follow-up ranged between 37 and 90 months (mean 60 months). Results: Patients with generalized dystonia showed a mean improvement in the BFM movement score of 39.4% (range 0 68.8%), 42.5% (range -16.0 to 81.3%) and 46.8% (range-2.7 to 83.1%) at the 3- and 12-month, and long-term follow-up, respectively. In focal/ segmental dystonia, the mean reduction in the Tsui score was 36.8% (range 0-100%), 65.1% (range 16.7-100%) and 59.8% (range 16.7-100%) at the 3- and 12-month, and long-term follow-up, respectively. Local infections were noted in 2 patients and hardware problems (electrode dislocation and breakage of the extension cable) in 1 patient. Conclusion: Our data showed Gpi-DBS to offer a very effective and safe therapy for different kinds of primary dystonia, with a significant long-term benefit in the majority of cases. Copyright (c) 2008 S. Karger AG, Base

    Nothing to hide: An X-ray survey for young stellar objects in the Pipe Nebula

    Full text link
    We have previously analyzed sensitive mid-infrared observations to establish that the Pipe Nebula has a very low star-formation efficiency. That study focused on YSOs with excess infrared emission (i.e, protostars and pre-main sequence stars with disks), however, and could have missed a population of more evolved pre-main sequence stars or Class III objects (i.e., young stars with dissipated disks that no longer show excess infrared emission). Evolved pre-main sequence stars are X-ray bright, so we have used ROSAT All-Sky Survey data to search for diskless pre-main sequence stars throughout the Pipe Nebula. We have also analyzed archival XMM-Newton observations of three prominent areas within the Pipe: Barnard 59, containing a known cluster of young stellar objects; Barnard 68, a dense core that has yet to form stars; and the Pipe molecular ring, a high-extinction region in the bowl of the Pipe. We additionally characterize the X-ray properties of YSOs in Barnard 59. The ROSAT and XMM-Newton data provide no indication of a significant population of more evolved pre-main sequence stars within the Pipe, reinforcing our previous measurement of the Pipe's very low star formation efficiency.Comment: Accepted for publication in Ap

    An Introduction to MPEG-G: The First Open ISO/IEC Standard for the Compression and Exchange of Genomic Sequencing Data

    Get PDF
    The development and progress of high-throughput sequencing technologies have transformed the sequencing of DNA from a scientific research challenge to practice. With the release of the latest generation of sequencing machines, the cost of sequencing a whole human genome has dropped to less than 600. Such achievements open the door to personalized medicine, where it is expected that genomic information of patients will be analyzed as a standard practice. However, the associated costs, related to storing, transmitting, and processing the large volumes of data, are already comparable to the costs of sequencing. To support the design of new and interoperable solutions for the representation, compression, and management of genomic sequencing data, the Moving Picture Experts Group (MPEG) jointly with working group 5 of ISO/TC276 'Biotechnology' has started to produce the ISO/IEC 23092 series, known as MPEG-G. MPEG-G does not only offer higher levels of compression compared with the state of the art but it also provides new functionalities, such as built-in support for random access in the compressed domain, support for data protection mechanisms, flexible storage, and streaming capabilities. MPEG-G only specifies the decoding syntax of compressed bitstreams, as well as a file format and a transport format. This allows for the development of new encoding solutions with higher degrees of optimization while maintaining compatibility with any existing MPEG-G decoder

    Extracellular IgC2 Constant Domains of CEACAMs Mediate PI3K Sensitivity during Uptake of Pathogens

    Get PDF
    Several pathogenic bacteria utilize receptors of the CEACAM family to attach to human cells. Binding to different members of this receptor family can result in uptake of the bacteria. Uptake of Neisseria gonorrhoeae, a gram-negative human pathogen, via CEACAMs found on epithelial cells, such as CEACAM1, CEA or CEACAM6, differs mechanistically from phagocytosis mediated by CEACAM3, a CEACAM family member expressed selectively by human granulocytes.We find that CEACAM1- as well as CEACAM3-mediated bacterial internalization are accompanied by a rapid increase in phosphatidylinositol-3,4,5 phosphate (PI(3,4,5)P) at the site of bacterial entry. However, pharmacological inhibition of phosphatidylinositol-3' kinase (PI3K) selectively affects CEACAM1-mediated uptake of Neisseria gonorrhoeae. Accordingly, overexpression of the PI(3,4,5)P phosphatase SHIP diminishes and expression of a constitutive active PI3K increases CEACAM1-mediated internalization of gonococci, without influencing uptake by CEACAM3. Furthermore, bacterial uptake by GPI-linked members of the CEACAM family (CEA and CEACAM6) and CEACAM1-mediated internalization of N. meningitidis by endothelial cells require PI3K activity. Sensitivity of CEACAM1-mediated uptake toward PI3K inhibition is independent of receptor localization in cholesterol-rich membrane microdomains and does not require the cytoplasmic or the transmembrane domain of CEACAM1. However, PI3K inhibitor sensitivity requires the Ig(C2)-like domains of CEACAM1, which are also present in CEA and CEACAM6, but which are absent from CEACAM3. Accordingly, overexpression of CEACAM1 Ig(C2) domains blocks CEACAM1-mediated internalization.Our results provide novel mechanistic insight into CEACAM1-mediated endocytosis and suggest that epithelial CEACAMs associate in cis with other membrane receptor(s) via their extracellular domains to trigger bacterial uptake in a PI3K-dependent manner

    Bedingungen vorzeitiger Beendigung der Erwerbsphase: ein PLS-Modell zur Erklärung der Kausalzusammenhänge am Beispiel des Vorruhestands

    Full text link
    Das Forschungsprogramm des Sonderforschungsbereich 186 ist vom methodischen Ansatz her darauf ausgerichtet, die häufig anzutreffende forschungsstrategische Separierung zwischen Verfahren zur Strukturanalyse gesellschaftlicher Verhältnisse und zur Untersuchung individueller Deutungsmuster sozialer Normen und Handlungsbedingungen aufzuheben. Vor diesem Hintergrund befasst sich die teilprojektübergreifende Methodenarbeitsgruppe auch mit neueren Erhebungs- und Auswertungsverfahren und möchte - wie im vorliegenden Fall - Möglichkeiten der Verfahren zur Untersuchung von Statuspassagen auch über den engeren Arbeitszusammenhang hinaus dokumentieren. Die hier explorativ anhand Daten aus einer Untersuchung in der Chemie- und Papierbranche erarbeitete Fragestellung wird im Teilprojekt C4 "Abstiegskarrieren und Auffangpositionen" an Daten einer gesetzlichen Krankenversicherung weiter verfolgt werden

    A comprehensive video codec comparison

    Get PDF
    In this paper, we compare the video codecs AV1 (version 1.0.0-2242 from August 2019), HEVC (HM and x265), AVC (x264), the exploration software JEM which is based on HEVC, and the VVC (successor of HEVC) test model VTM (version 4.0 from February 2019) under two fair and balanced configurations: All Intra for the assessment of intra coding and Maximum Coding Efficiency with all codecs being tuned for their best coding efficiency settings. VTM achieves the highest coding efficiency in both configurations, followed by JEM and AV1. The worst coding efficiency is achieved by x264 and x265, even in the placebo preset for highest coding efficiency. AV1 gained a lot in terms of coding efficiency compared to previous versions and now outperforms HM by 24% BD-Rate gains. VTM gains 5% over AV1 in terms of BD-Rates. By reporting separate numbers for JVET and AOM test sequences, it is ensured that no bias in the test sequences exists. When comparing only intra coding tools, it is observed that the complexity increases exponentially for linearly increasing coding efficiency

    GVC: efficient random access compression for gene sequence variations

    Get PDF
    Background: In recent years, advances in high-throughput sequencing technologies have enabled the use of genomic information in many fields, such as precision medicine, oncology, and food quality control. The amount of genomic data being generated is growing rapidly and is expected to soon surpass the amount of video data. The majority of sequencing experiments, such as genome-wide association studies, have the goal of identifying variations in the gene sequence to better understand phenotypic variations. We present a novel approach for compressing gene sequence variations with random access capability: the Genomic Variant Codec (GVC). We use techniques such as binarization, joint row- and column-wise sorting of blocks of variations, as well as the image compression standard JBIG for efficient entropy coding. Results: Our results show that GVC provides the best trade-off between compression and random access compared to the state of the art: it reduces the genotype information size from 758 GiB down to 890 MiB on the publicly available 1000 Genomes Project (phase 3) data, which is 21% less than the state of the art in random-access capable methods. Conclusions: By providing the best results in terms of combined random access and compression, GVC facilitates the efficient storage of large collections of gene sequence variations. In particular, the random access capability of GVC enables seamless remote data access and application integration. The software is open source and available at https://github.com/sXperfect/gvc/

    GABAC : An arithmetic coding solution for genomic data

    Get PDF
    Motivation: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. Results: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. © 2019 The Author(s). Published by Oxford University Press
    • …
    corecore